[SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs #49840

szehon-ho · 2025-02-06T23:20:47Z

What changes were proposed in this pull request?

Simplify the resolution of EXISTS_DEFAULT on ResolveDefaultColumns::getExistenceDefaultValues(), which are called from file readers on executors.

Why are the changes needed?

Spark executors unnecessary contacts catalogs when resolving EXISTS_DEFAULTS (used for default values for existing data) for a column.

Detailed explanation: The code path for default values first runs an analysis of the user-provided CURRENT_DEFAULT value for a column (to evaluate functions, etc), and uses the result sql to save as the column's EXISTS_DEFAULT. EXISTS_DEFAULT is then used to avoid having to rewrite existing data using backfill to fill this value in the files. When reading existing files, Spark then attempts to resolve the EXISTS_DEFAULT metadata and use the value for null values it finds in that column.

The problem is, this second step on read redundantly runs all the analyzer rules again and finish analysis rules on EXISTS_DEFAULTS, some of which contact the catalog unnecessarily. Some of those rules are unnecessary as they were already run before to get the value.

Worse, it may cause exceptions if the executors are not configured properly to reach the catalog, such as:

Caused by: org.apache.spark.SparkException: Failed during instantiating constructor for catalog 'spark_catalog': org.apache.spark.sql.delta.catalog.DeltaCatalog. at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToInstantiateConstructorForCatalogError(QueryExecutionErrors.scala:2400) at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:84) at org.apache.spark.sql.connector.catalog.CatalogManager.loadV2SessionCatalog(CatalogManager.scala:72) at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$2(CatalogManager.scala:94) at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$1(CatalogManager.scala:94) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.connector.catalog.CatalogManager.v2SessionCatalog(CatalogManager.scala:93) at org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:55) at org.apache.spark.sql.connector.catalog.CatalogManager.currentCatalog(CatalogManager.scala:130) at org.apache.spark.sql.connector.catalog.CatalogManager.currentNamespace(CatalogManager.scala:101) at org.apache.spark.sql.catalyst.optimizer.ReplaceCurrentLike.apply(finishAnalysis.scala:172) at org.apache.spark.sql.catalyst.optimizer.ReplaceCurrentLike.apply(finishAnalysis.scala:169) at org.apache.spark.sql.catalyst.optimizer.Optimizer$FinishAnalysis$.$anonfun$apply$1(Optimizer.scala:502) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.optimizer.Optimizer$FinishAnalysis$.apply(Optimizer.scala:502) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyze(ResolveDefaultColumnsUtil.scala:301) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyze(ResolveDefaultColumnsUtil.scala:266) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$getExistenceDefaultValues$2(ResolveDefaultColumnsUtil.scala:427) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$getExistenceDefaultValues$1(ResolveDefaultColumnsUtil.scala:425) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.getExistenceDefaultValues(ResolveDefaultColumnsUtil.scala:423) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$existenceDefaultValues$2(ResolveDefaultColumnsUtil.scala:498) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.existenceDefaultValues(ResolveDefaultColumnsUtil.scala:496) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns.existenceDefaultValues(ResolveDefaultColumnsUtil.scala) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:350) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:373) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.$anonfun$apply$5(ParquetFileFormat.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1561) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:428) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:639) ... 21 more Caused by: java.lang.IllegalStateException: No active or default Spark session found

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a test in StructTypeSuite. I had to expose for testing some members in ResolveDefaultColumns.

Was this patch authored or co-authored using generative AI tooling?

No

szehon-ho

Leaving some explanations

szehon-ho · 2025-02-06T23:22:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

+   *
+   * VisibleForTesting
+   */
+  def analyzeExistingDefault(field: StructField,


This is simpler version of analyze, used for existsDefaultValues (which is called from executors). Make in another method, as we have another may have an opportunity to simplify further, but for now it seems some part of analysis is still needed to resolve some functions like array(). The problematic code of FinishAnalysis was removed though.

szehon-ho · 2025-02-06T23:23:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

+  /**
+   * Visible for testing
+   */
+  def setAnalyzerAndOptimizer(analyzer: Analyzer, optimizer: Optimizer): Unit = {


Its hard to reproduce the issue in unit test, so I end up mocking these members to verify that the catalogs are not called.

let me know, Im totally fine with removing this if we think existing test coverage is ok.

…ld not call catalogs

szehon-ho · 2025-02-08T01:42:53Z

Had a chat offline with @cloud-fan who suggest simplifying the analyzeExistence to be just the following bare bones code to resolve functions.

  def fromSQL(sql: String): Expression = {
    CatalystSqlParser.parseExpression(sql).transformUp {
      case u: UnresolvedFunction =>
          assert(u.nameParts.length == 1)
          assert(!u.isDistinct)
          FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments)
    }
  }

Thanks for the suggestion, this simplifies it a lot.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

cloud-fan · 2025-02-08T02:32:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

+    CatalystSqlParser.parseExpression(sql).transformUp {
+      case u: UnresolvedFunction =>
+          assert(u.nameParts.length == 1)
+          assert(!u.isDistinct)


can we check other fields in UnresolvedFunction as well? The SQL string produced by Literal#sql should not specify any extra fields

Done, i added a bunch of asserts. Not too familiar with these flags, please double check

cloud-fan · 2025-02-08T02:34:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

+        "", field.name, defaultSQL, null)
+    }
+
+    expr


shall we still call coerceDefaultValue at the end?

It shouldn't be strictly necessary, but would be a no-op on executors, and we'll be safer to have it, so might as well keep it to double-check that the existence default value is the type we expect.

dtenedor

Thanks for working on this!!

dtenedor · 2025-02-08T04:18:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

+        "", field.name, defaultSQL, null)
+    }
+
+    expr


It shouldn't be strictly necessary, but would be a no-op on executors, and we'll be safer to have it, so might as well keep it to double-check that the existence default value is the type we expect.

dtenedor · 2025-02-08T04:21:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

-          expr match {
-            case _: ExprLiteral | _: Cast => expr
-          }
+          analyzeExistingDefault(field, text)


unit testing idea: we could try to stub out the catalog manager with something that always throws errors, if we wanted to be extra careful. Our DefaultColumnAnalyzer currently looks like this:

/** * This is an Analyzer for processing default column values using built-in functions only. */ object DefaultColumnAnalyzer extends Analyzer( new CatalogManager(BuiltInFunctionCatalog, BuiltInFunctionCatalog.v1Catalog)) { }

We could use a different catalog manager that always throws errors there specifically when we exercise this getExistenceDefaultValues method. This would make sure we're not inadvertently doing catalog manager lookups there anymore.

@dtarima yea i was back and forth on this, I had this test originally but removed it in 0491833 (hope the link works). It seemed a little intrusive on production code and debatable value, but Im open.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

dtenedor

LGTM after resolving remaining comments. Thanks again for the fix!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

dongjoon-hyun

+1, LGTM. Thank you, @szehon-ho , @cloud-fan , @dtenedor . (Pending CIs).

…ld not call catalogs ### What changes were proposed in this pull request? Simplify the resolution of EXISTS_DEFAULT on ResolveDefaultColumns::getExistenceDefaultValues(), which are called from file readers on executors. ### Why are the changes needed? Spark executors unnecessary contacts catalogs when resolving EXISTS_DEFAULTS (used for default values for existing data) for a column. Detailed explanation: The code path for default values first runs an analysis of the user-provided CURRENT_DEFAULT value for a column (to evaluate functions, etc), and uses the result sql to save as the column's EXISTS_DEFAULT. EXISTS_DEFAULT is then used to avoid having to rewrite existing data using backfill to fill this value in the files. When reading existing files, Spark then attempts to resolve the EXISTS_DEFAULT metadata and use the value for null values it finds in that column. The problem is, this second step on read redundantly runs all the analyzer rules again and finish analysis rules on EXISTS_DEFAULTS, some of which contact the catalog unnecessarily. Some of those rules are unnecessary as they were already run before to get the value. Worse, it may cause exceptions if the executors are not configured properly to reach the catalog, such as: ``` Caused by: org.apache.spark.SparkException: Failed during instantiating constructor for catalog 'spark_catalog': org.apache.spark.sql.delta.catalog.DeltaCatalog. at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToInstantiateConstructorForCatalogError(QueryExecutionErrors.scala:2400) at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:84) at org.apache.spark.sql.connector.catalog.CatalogManager.loadV2SessionCatalog(CatalogManager.scala:72) at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$2(CatalogManager.scala:94) at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$1(CatalogManager.scala:94) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.connector.catalog.CatalogManager.v2SessionCatalog(CatalogManager.scala:93) at org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:55) at org.apache.spark.sql.connector.catalog.CatalogManager.currentCatalog(CatalogManager.scala:130) at org.apache.spark.sql.connector.catalog.CatalogManager.currentNamespace(CatalogManager.scala:101) at org.apache.spark.sql.catalyst.optimizer.ReplaceCurrentLike.apply(finishAnalysis.scala:172) at org.apache.spark.sql.catalyst.optimizer.ReplaceCurrentLike.apply(finishAnalysis.scala:169) at org.apache.spark.sql.catalyst.optimizer.Optimizer$FinishAnalysis$.$anonfun$apply$1(Optimizer.scala:502) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.optimizer.Optimizer$FinishAnalysis$.apply(Optimizer.scala:502) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyze(ResolveDefaultColumnsUtil.scala:301) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyze(ResolveDefaultColumnsUtil.scala:266) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$getExistenceDefaultValues$2(ResolveDefaultColumnsUtil.scala:427) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$getExistenceDefaultValues$1(ResolveDefaultColumnsUtil.scala:425) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.getExistenceDefaultValues(ResolveDefaultColumnsUtil.scala:423) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$existenceDefaultValues$2(ResolveDefaultColumnsUtil.scala:498) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.existenceDefaultValues(ResolveDefaultColumnsUtil.scala:496) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns.existenceDefaultValues(ResolveDefaultColumnsUtil.scala) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:350) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:373) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.$anonfun$apply$5(ParquetFileFormat.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1561) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:428) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:639) ... 21 more Caused by: java.lang.IllegalStateException: No active or default Spark session found ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test in StructTypeSuite. I had to expose for testing some members in ResolveDefaultColumns. ### Was this patch authored or co-authored using generative AI tooling? No Closes #49840 from szehon-ho/SPARK-51119. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 937decc) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2025-02-11T00:35:04Z

According to the affected version, I merged this to master/4.0.

szehon-ho · 2025-02-11T00:48:36Z

Thanks @cloud-fan @dtenedor @dongjoon-hyun !

cloud-fan · 2025-02-11T00:50:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

+        assert(!u.isInternal)
+        FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments)
+    } match {
+      case c: Cast if c.needsTimeZone =>


the CAST can be nested inside array/map/struct, we should put this case match inside the transformUp, together with case u: UnresolvedFunction

@szehon-ho can you make a followup PR for it?

@cloud-fan sure, let me do that.

BTW, I looked a little bit and couldnt reproduce a failure with the current implementation using a following unit test with a nested cast:

test("SPARK-51119: array of timestamp should have timezone if default values castable") { withTable("t") { sql(s"CREATE TABLE t(key int, c ARRAY<STRING> DEFAULT " + s"ARRAY(CAST(timestamp '2018-11-17' AS STRING))) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, Array("2018-11-17 00:00:00"))) } }

Unlike the failing case of top-level cast:

test("SPARK-46958: timestamp should have timezone for resolvable if default values castable") { val defaults = Seq("timestamp '2018-11-17'", "CAST(timestamp '2018-11-17' AS STRING)") defaults.foreach { default => withTable("t") { sql(s"CREATE TABLE t(key int, c STRING DEFAULT $default) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, "2018-11-17 00:00:00")) } } }

EXISTS_DEFAULT is saved without a cast in the first case: ARRAY('2018-11-17 00:00:00') (looks like it got evaluated)
and with a cast in the second case: CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING)

So I think in this particular scenario, it doesnt matter. But agree that it is better to have it, as we are making a generic method.

I'm looking at the previous test failure

Cause: org.apache.spark.sql.AnalysisException: [INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION] Failed to execute command because the destination column or variable `c` has a DEFAULT value CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING), which fails to resolve as a valid expression. SQLSTATE: 42623

CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING) can't be generated by Literal#sql. Seems we have some misunderstanding about how this existing default string is generated. @szehon-ho can you take a closer look?

synced offline, see the other comment.

cloud-fan · 2025-02-11T03:14:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

@@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase
    coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL)


I think the CAST is added here, but it should be constant-folded before we generate the existing default string. We need to debug it.

synced with @cloud-fan offline, this is not constant folded after this line, when analyzing to create EXISTS_DEFAULT. So in the input of analyzeExistsDefault() , EXISTS_DEFAULT sometimes has a top level CAST

szehon-ho · 2025-02-11T06:29:02Z

Synced with @cloud-fan offline, the current code should work for this case. Made follow up #49881 to do some cleanup to put the logic in the right place.

…EFAULT should not call catalogs ### What changes were proposed in this pull request? Code cleanup for #49840. Literal#fromSQL should be the inverse of Literal#sql. The cast handling is an artifact of the calling ResolveDefaultColumns object that adds the cast when making EXISTS_DEFAULT, so its handling is moved to ResolveDefaultColumns as well. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49881 from szehon-ho/SPARK-51119-follow. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…EFAULT should not call catalogs ### What changes were proposed in this pull request? Code cleanup for #49840. Literal#fromSQL should be the inverse of Literal#sql. The cast handling is an artifact of the calling ResolveDefaultColumns object that adds the cast when making EXISTS_DEFAULT, so its handling is moved to ResolveDefaultColumns as well. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49881 from szehon-ho/SPARK-51119-follow. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 5135329) Signed-off-by: Wenchen Fan <[email protected]>

…il existenceDefaultValues ### What changes were proposed in this pull request? The original change in #49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded. However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break. Add fallback to use the original logic in that case. ### Why are the changes needed? There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as #49942 or other problems. ### Does this PR introduce _any_ user-facing change? No, it should handle bad metadata better. ### How was this patch tested? Add unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #49962 from szehon-ho/SPARK-51119-follow-2. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…il existenceDefaultValues ### What changes were proposed in this pull request? The original change in #49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded. However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break. Add fallback to use the original logic in that case. ### Why are the changes needed? There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as #49942 or other problems. ### Does this PR introduce _any_ user-facing change? No, it should handle bad metadata better. ### How was this patch tested? Add unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #49962 from szehon-ho/SPARK-51119-follow-2. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4ffc398) Signed-off-by: Wenchen Fan <[email protected]>

…il existenceDefaultValues ### What changes were proposed in this pull request? The original change in apache#49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded. However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break. Add fallback to use the original logic in that case. ### Why are the changes needed? There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as apache#49942 or other problems. ### Does this PR introduce _any_ user-facing change? No, it should handle bad metadata better. ### How was this patch tested? Add unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49962 from szehon-ho/SPARK-51119-follow-2. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Feb 6, 2025

szehon-ho commented Feb 6, 2025

View reviewed changes

[SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT shou…

164bd6b

…ld not call catalogs

szehon-ho force-pushed the SPARK-51119 branch 2 times, most recently from ccd071e to ce67d1d Compare February 8, 2025 01:40

Simplify analyzeExistingDefault

2fa0b0b

szehon-ho force-pushed the SPARK-51119 branch from ce67d1d to 2fa0b0b Compare February 8, 2025 01:41

cloud-fan reviewed Feb 8, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 8, 2025

View reviewed changes

Remove test that now does not test anything

0491833

dtenedor reviewed Feb 8, 2025

View reviewed changes

Add more asserts to UnresolvedFunction and restore coerceDefaultValue

fc92de7

cloud-fan approved these changes Feb 10, 2025

View reviewed changes

cloud-fan reviewed Feb 10, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala Show resolved Hide resolved

dtenedor approved these changes Feb 10, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala Outdated Show resolved Hide resolved

Review comments and fix test

57793fd

dongjoon-hyun approved these changes Feb 10, 2025

View reviewed changes

dongjoon-hyun closed this in 937decc Feb 11, 2025

cloud-fan reviewed Feb 11, 2025

View reviewed changes

szehon-ho mentioned this pull request Feb 11, 2025

[SPARK-51119][SQL][FOLLOW-UP] Readers on executors resolving EXISTS_DEFAULT should not call catalogs #49881

Closed

szehon-ho mentioned this pull request Feb 14, 2025

[SPARK-51119][SQL][FOLLOW-UP] Add fallback to ResolveDefaultColumnsUtil existenceDefaultValues #49962

Closed

		@@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase
		coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL)

[SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs #49840

[SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs #49840

Conversation

szehon-ho commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Feb 8, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtenedor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dtenedor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 11, 2025

Uh oh!

szehon-ho commented Feb 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Feb 11, 2025

Uh oh!

Uh oh!

szehon-ho commented Feb 6, 2025 •

edited

Loading

szehon-ho Feb 6, 2025 •

edited

Loading

szehon-ho Feb 6, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

szehon-ho Feb 11, 2025 •

edited

Loading

szehon-ho Feb 11, 2025 •

edited

Loading